Skip to content

Remove old Spark 3.3.1 through 3.4 shim sources#15036

Open
gerashegalov wants to merge 1 commit into
codex/unshim-stack-02t-shim-cleanup-330-dbrfrom
codex/unshim-stack-02u-shim-cleanup-334
Open

Remove old Spark 3.3.1 through 3.4 shim sources#15036
gerashegalov wants to merge 1 commit into
codex/unshim-stack-02t-shim-cleanup-330-dbrfrom
codex/unshim-stack-02u-shim-cleanup-334

Conversation

@gerashegalov

@gerashegalov gerashegalov commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Related to #14834.

Description

This PR is one reviewable layer in the unshim stack introduced by #15025. It removes old Spark 3.3.1 through Spark 3.4 shim sources that are now provided by shared helpers or the new helper modules.

Stack context

Testing and validation notes

  • No standalone behavior change is intended in this layer. It is covered by the full-stack packaging/build validation described in Add default common unshim packaging flow #15025 and the existing tests for the affected subsystem.
  • The full split stack was verified to be tree-equivalent to the pre-split stack top.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Covered by the validation notes in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from 49480ce to c0d071d Compare June 10, 2026 20:49
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from eb17bcf to 81c9244 Compare June 10, 2026 20:49
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from c0d071d to c6d9166 Compare June 10, 2026 21:13
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch 4 times, most recently from e4f3ab7 to cc57ebc Compare June 10, 2026 22:20
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch 2 times, most recently from 5febf4b to 7bb6083 Compare June 10, 2026 22:37
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from cc57ebc to 92c28d1 Compare June 10, 2026 22:37
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from 7bb6083 to 796810e Compare June 10, 2026 22:41
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch 2 times, most recently from 0bcc39d to 78e3755 Compare June 10, 2026 22:46
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from 2663bb3 to d746dd5 Compare June 10, 2026 22:59
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from 78e3755 to a43841a Compare June 10, 2026 22:59
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from d746dd5 to cd226f8 Compare June 10, 2026 23:12
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch 2 times, most recently from 721a940 to d1cfd5c Compare June 10, 2026 23:15
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch 2 times, most recently from e4d544e to 7ee72ed Compare June 10, 2026 23:29
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch 2 times, most recently from 4d278cb to c06aa0a Compare June 10, 2026 23:33
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from 7ee72ed to 3d5e85e Compare June 10, 2026 23:33
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from c06aa0a to 9780ace Compare June 10, 2026 23:48
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch 2 times, most recently from 47cc5a9 to c2f0e73 Compare June 10, 2026 23:59
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from 9780ace to 278ab3d Compare June 10, 2026 23:59
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from c2f0e73 to 3c2a442 Compare June 11, 2026 00:25
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from 278ab3d to 4654034 Compare June 11, 2026 00:25
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from 3c2a442 to f27a5af Compare June 11, 2026 00:37
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from 4654034 to 0089b56 Compare June 11, 2026 00:37
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from f27a5af to df68682 Compare June 11, 2026 00:51
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from 0089b56 to c546023 Compare June 11, 2026 00:51
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from df68682 to 9d8e2a1 Compare June 11, 2026 01:18
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from c546023 to a72346b Compare June 11, 2026 01:18
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from 9d8e2a1 to d8581c4 Compare June 11, 2026 01:32
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from a72346b to 873c175 Compare June 11, 2026 01:32
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from d8581c4 to 9973398 Compare June 11, 2026 01:43
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from 873c175 to 550ff06 Compare June 11, 2026 01:43
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from 9973398 to 9bb4096 Compare June 11, 2026 01:58
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch 2 times, most recently from 17f3131 to a735370 Compare June 11, 2026 02:26
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from 9bb4096 to 903afa7 Compare June 11, 2026 02:26
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from a735370 to 2feeaf8 Compare June 13, 2026 12:13
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from 903afa7 to 0b03989 Compare June 13, 2026 12:13
Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02t-shim-cleanup-330-dbr branch from 0b03989 to 9103bb7 Compare June 13, 2026 12:20
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02u-shim-cleanup-334 branch from 2feeaf8 to bb08a89 Compare June 13, 2026 12:20
@gerashegalov gerashegalov marked this pull request as ready for review June 13, 2026 12:49
@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR is one layer in a broader "unshim stack" that consolidates version-specific Spark shim sources into shared helpers. It removes duplicate shim sources for Spark 3.3.1–3.4.x that are now provided by new helper modules introduced in the preceding stack layer.

  • 11 SparkShimServiceProvider boilerplate files deleted (spark331–spark344, including Databricks variants); their version registration is now handled by shared infrastructure from the previous stack layer.
  • Several classes migrated off Spark's internal Logging trait (RapidsShuffleIterator, RapidsCachingReader, HiveFileUtil, GpuWindowGroupLimitingIterator) in favour of direct SLF4J, which is required when source files are compiled into multiple shim JARs without access to Spark internals.
  • OrcProtoWriterShim refactored to reflection to support both org.apache.orc.protobuf and com.google.protobuf variants across ORC library generations; GpuGroupedPythonRunnerFactory converted from case class to class extends Serializable with explicit argNames parameter; blank lines added to Python runner files for binary-dedupe alignment.

Confidence Score: 4/5

The PR is a structural cleanup with no intended behavioral change; the full-stack build was validated tree-equivalent to the pre-split stack, so merging carries low risk.

The deletions are mechanical and the refactorings (Logging → SLF4J, WriteFilesExecRule delegation, shim provider removal) are straightforward. The OrcProtoWriterShim reflection rewrite is the most complex change: it is logically correct but surfaces an opaque error message when neither protobuf library is resolvable at runtime. The GpuGroupedPythonRunnerFactory API break (removed argNames default) is an internal shim concern that should be covered by the full-stack build.

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/OrcProtoWriterShim.scala deserves a second look for the reflection-based proto dispatch and its error-path diagnostics.

Important Files Changed

Filename Overview
sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/OrcProtoWriterShim.scala Refactored to use reflection to support both org.apache.orc.protobuf and com.google.protobuf variants; introduces mutable per-instance state and a potentially misleading error message when no proto library is found
sql-plugin/src/main/spark332db/scala/com/nvidia/spark/rapids/shims/WriteFilesExecRule.scala Renamed/replaced from SparkUpgradeExceptionShims.scala; provides WriteFilesExec GPU rule shared across many shim versions; copyright year truncated to single year 2026
sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupedPythonRunnerFactory.scala Changed from case class to class with explicit Serializable; removed default argNames=None, which is a breaking API change for callers not explicitly passing argNames
sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shuffle/RapidsShuffleIterator.scala Migrated from Spark internal Logging trait to direct SLF4J; only single-argument log helpers are added (no throwable overloads)
sql-plugin/src/main/spark340/scala/org/apache/spark/sql/rapids/RapidsCachingReader.scala Migrated from Spark internal Logging trait to direct SLF4J; only logInfo and logDebug helpers are added
sql-plugin/src/main/spark341db/scala/com/nvidia/spark/rapids/shims/GpuWindowGroupLimitExec.scala Removed unused Logging trait import and mixin from GpuWindowGroupLimitingIterator; no log calls present in the class so removal is clean
sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonRunner.scala Added blank lines to align executable line numbers with pre-Spark-4 shims for binary deduplication; intentional and documented in comment
sql-plugin/src/main/spark332db/scala/org/apache/spark/sql/hive/rapids/shims/HiveFileUtil.scala Removed Spark internal Logging mixin and replaced with direct SLF4J; only logWarning is used in the file and is correctly replicated
sql-plugin/src/main/spark331/scala/com/nvidia/spark/rapids/shims/Spark331PlusNonDBShims.scala CheckOverflowInTableInsert expression registration delegated to shared CheckOverflowInTableInsertShims.exprs helper
sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/Spark340PlusNonDBShims.scala WriteFilesExec rule registration delegated to WriteFilesExecRule.execs; CreateDataSourceTableAsSelectCommand uses new runnableCmdFromShim helper

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["SparkShimServiceProvider (deleted per version)\nspark331 / 332 / 332db / 333 / 334 / 340 / 341 / 341db / 342 / 343 / 344"]
    B["Shared helper module\n(previous stack layer #15053)"]
    A -->|"replaced by"| B

    C["Spark.internal.Logging (removed)"]
    D["Direct SLF4J logger"]
    C -->|"replaced by"| D
    C --> RapidsShuffleIterator
    C --> RapidsCachingReader
    C --> HiveFileUtil
    C --> GpuWindowGroupLimitingIterator
    RapidsShuffleIterator --> D
    RapidsCachingReader --> D
    HiveFileUtil --> D
    GpuWindowGroupLimitingIterator --> D

    E["OrcProtoWriterShim (static CodedOutputStream)"]
    F["OrcProtoWriterShim (reflection over protoApis)\nSupports org.apache.orc.protobuf + com.google.protobuf"]
    E -->|"refactored to"| F

    G["WriteFilesExecRule (new shared object)\nWriteFilesExecShims.exec → GpuWriteFilesMeta"]
    H["Spark332PlusDBShims.getExecs"]
    I["Spark340PlusNonDBShims.getExecs"]
    G --> H
    G --> I
Loading

Comments Outside Diff (1)

  1. sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/OrcProtoWriterShim.scala, line 557-565 (link)

    P2 Misleading error when no protobuf library is loadable

    If both org.apache.orc.protobuf and com.google.protobuf are unavailable at runtime (e.g., due to a misconfigured classpath), protoApis will be empty and every call to writeAndFlush will throw IllegalArgumentException("Unexpected protobuf message type: …"). That message implies the object is the wrong type, when the real problem is that neither codec library could be loaded. It would improve diagnostics to assert protoApis.nonEmpty at initialization (or inside apply) and surface a clear "no protobuf implementation found on classpath" error early.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "Remove old Spark 3.3.1 through 3.4 shim ..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants